Predictive Reliability and Fault Management in Exascale Systems
نویسندگان
چکیده
منابع مشابه
Introspective Fault Tolerance for Exascale Systems∗
Faults and errors are an unavoidable aspect of high performance computing systems. Emerging exascale systems will contain billions of hardware components and complex software stacks. In addition, higher fabrication density and power challenges will further compound fault detection, management and recovery. Efficient fault tolerance and resiliency frameworks are thus of immense importance in the...
متن کاملExploring reliability of exascale systems through simulations
Exascale computers are predicted to emerge by the end of this decade with millions of nodes and billions of concurrent cores/threads. One of the most critical challenges for exascale computing is how to effectively and efficiently maintain the system reliability. Checkpointing is the state-of-theart technique for high-end computing system reliability that has proved to work well for current pet...
متن کاملTotal order broadcast for fault tolerant exascale systems
In the process of designing a new fault tolerant run-time for future exascale systems, we discovered that a total order broadcast would be necessary. That is, nodes of a supercomputer should be able to broadcast messages to other nodes even in the face of failures. All messages should be seen in the same order at all nodes. While this is a well studied problem in distributed systems, few resear...
متن کاملRestoring Reliability in Fault Tolerant Reconfigurable Systems
The new generations of SRAM-based FPGA devices, built on nanometer technology, are the preferred choice for the implementation of reconfigurable computing platforms. However, smaller technological scales increase their vulnerability to manufacturing imperfections and hence to the occurrence of electromigration. Moreover, the large internal RAM (for configuration purposes or as embedded memory b...
متن کاملPower Management for Exascale∗
Most performance studies of large-scale HPC systems and their workloads have focused primarily on flops, bandwidth, and latency. Few concrete studies exist that focus on quantifying power and energy consumption at the hardware and software levels. Until recently, system vendors have had little incentive to expose extensive system and component-level power interfaces to users. Consequently, the ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: ACM Computing Surveys
سال: 2020
ISSN: 0360-0300,1557-7341
DOI: 10.1145/3403956